The Influence Relevance Voter: An Accurate And Interpretable Virtual High Throughput Screening Method
نویسندگان
چکیده
Given activity training data from Hight-Throughput Screening (HTS) experiments, virtual High-Throughput Screening (vHTS) methods aim to predict in silico the activity of untested chemicals. We present a novel method, the Influence Relevance Voter (IRV), specifically tailored for the vHTS task. The IRV is a low-parameter neural network which refines a k-nearest neighbor classifier by non-linearly combining the influences of a chemical’s neighbors in the training set. Influences are decomposed, also non-linearly, into a relevance component and a vote component. †Institute for Genomics and Bioinformatics, UCI ‡School of Biological Sciences, UCI §Universidad Nacional de Rosario 1 S. Joshua Swamidass et al. The Influence Relevance Voter The IRV is benchmarked using the data and rules of two large, open, competitions, and its performance compared to the performance of other participating methods, as well as of an in-house Support Vector Machine (SVM) method. On these benchmark datasets, IRV achieves state-of-the-art results, comparable to the SVM in one case, and significantly better than the SVM in the other, retrieving three times as many actives in the top 1% of its prediction-sorted list. The IRV presents several other important advantages over SVMs and other methods: (1) the output predictions have a probabilistic semantic; (2) the underlying inferences are interpretable; (3) the training time is very short, on the order of minutes even for very large data sets; (4) the risk of overfitting is minimal, due to the small number of free parameters; and (5) additional information can easily be incorporated into the IRV architecture. Combined with its performance, these qualities make the IRV particularly well suited for vHTS. Virtual High-Throughput Screening (vHTS) is the cost-effective, in silico complement of experimental HTS. A vHTS algorithm uses data from HTS experiments to predict the activity of new sets of compounds in silico. Although vHTS is sometimes cast as a classification task, it is more appropriately described as a ranking task, where the goal is to rank additional compounds, such that active compounds are close to the the top of the prediction-sorted list as possible. The experiments required to verify a hit are expensive, so it is critical that true actives be recognized as early as possible. Accurately ordering actives by their degree of activity, however, is not critical. The vHTS task, therefore, differs from the ’ranking’ task of the machine learning literature, in that the goal is not to precisely order the chemicals in relation to each other, but rather to globally rank as many actives as possible above the bulk of the inactives. Furthermore, proper vHTS training data for the ranking task, is often unavailable. An important algorithm proposed for vHTS is the k-Nearest Neighbor (kNN) classifier, a nonparametric method which has been shown effective in a number of other problems.1,2 In the kNN approach, each new data point is classified by integrating information from its neighborhood in the training set in a very simple way. Specifically, a new data point is assigned to the class occurring most frequently among its k closest structural neighbors in the training set. 2 S. Joshua Swamidass et al. The Influence Relevance Voter While the kNN algorithm has been applied to chemical data, it does not perform optimally3–6 because all the nearest neighbors contribute equally, regardless of their relative properties and similarities to the test chemical. Hence, important concepts—such as “the more similar a chemical is to its active neighbors, the more likely it is to be active itself” or “the closest neighbors should influence the prediction more than the furthest ones”—are not representable with a kNN. Furthermore, the kNN is usually used to classify, rather than to rank, unknown data. The kNN output can be modified to be an integer between 0 and k by counting the number of neighbors that are active, instead of taking a binary majority vote, but even with this quantization many compounds are mapped to the same integer value and therefore cannot be properly ranked. This is a critical deficiency for vHTS where economic or other reasons may dictate that only a few of the top hits be testable. kNN, even in its quantized version, does not provide a clear ranking of its top hits. A number of researchers have attempted to rectify these deficiencies of kNN by employing alternate weighing schemes and decision rules.7–15 In many domains, including vHTS,9,10 these modifications can substantially improve classification performance. With few exceptions,10,12,13 however, these modifications are either somewhat ad hoc, untuned to the nuances of each dataset, or do not produce probabilistic predictions. Here we propose a novel vHTS method, the Influence Relevance Voter (IRV), which can also be viewed as an extension of the kNN algorithm. The IRV uses a neural network architecture16–18 to learn how to best integrate information from the nearest structural neighbors contained in the training set. The IRV tunes itself to each dataset by a simple gradient descent learning procedure and produces continuous outputs that can be interpreted probabilistically and used to rank all the compounds. We assess the performance of IRV on two benchmark datasets from two recent open datamining competitions. For comparison purposes, we also implement two other methods: MAXSIM and Support Vector Machines (SVM). MAX-SIM is a particularly simple algorithm, a useful baseline for comparison. In contrast, SVMs are a highly sophisticated class of methods which have been successfully applied to other chemical classification problems19,20 and are expected to yield high performances on vHTS datasets. Comparisons against the kNN method are not included
منابع مشابه
Influence Relevance Voting: An Accurate And Interpretable Virtual High Throughput Screening Method
Given activity training data from high-throughput screening (HTS) experiments, virtual high-throughput screening (vHTS) methods aim to predict in silico the activity of untested chemicals. We present a novel method, the Influence Relevance Voter (IRV), specifically tailored for the vHTS task. The IRV is a low-parameter neural network which refines a k-nearest neighbor classifier by nonlinearly ...
متن کاملNootropic Medicinal Plants; Evaluating Potent Formulation By Novelestic High throughput Pharmacological Screening (HTPS) Method
The principle of this method was to screen the pharmacological activity of six prepared polyphyto formulations by using high throughput screening method for their nootropic action. The study was performed in three stages using one, two and three animals, respectively in a group. Test formulations were given p.o daily at the dose of 50 and 100 mg/kg body weight. The test formulations were compar...
متن کاملSVM-Based Feature Selection for Characterization of Focused Compound Collections
Artificial neural networks, the support vector machine (SVM), and other machine learning methods for the classification of molecules are often considered as a "black box", since the molecular features that are most relevant for a given classifier are usually not presented in a human-interpretable form. We report on an SVM-based algorithm for the selection of relevant molecular features from a t...
متن کاملAssessment of "drug-likeness" of a small library of natural products using chemoinformatics
Even though natural products has an excellent record as a source for new drugs, the advent of ultrahigh-throughput screening and large-scale combinatorial synthetic methods, has caused a decline in the use of natural products research in the pharmaceutical industry. This is due to the efficiency in generating and screening a high number of synthetic combinatorial compounds; whereas traditional ...
متن کاملVirtual screening with support vector machines and structure kernels
Support vector machines and kernel methods have recently gained considerable attention in chemoinformatics. They offer generally good performance for problems of supervised classification or regression, and provide a flexible and computationally efficient framework to include relevant information and prior knowledge about the data and problems to be handled. In particular, with kernel methods m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009